{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Robust covariance estimation and Mahalanobis distances relevance\n\n\nAn example to show covariance estimation with the Mahalanobis\ndistances on Gaussian distributed data.\n\nFor Gaussian distributed data, the distance of an observation\n$x_i$ to the mode of the distribution can be computed using its\nMahalanobis distance: $d_{(\\mu,\\Sigma)}(x_i)^2 = (x_i -\n\\mu)'\\Sigma^{-1}(x_i - \\mu)$ where $\\mu$ and $\\Sigma$ are\nthe location and the covariance of the underlying Gaussian\ndistribution.\n\nIn practice, $\\mu$ and $\\Sigma$ are replaced by some\nestimates.  The usual covariance maximum likelihood estimate is very\nsensitive to the presence of outliers in the data set and therefor,\nthe corresponding Mahalanobis distances are. One would better have to\nuse a robust estimator of covariance to guarantee that the estimation is\nresistant to \"erroneous\" observations in the data set and that the\nassociated Mahalanobis distances accurately reflect the true\norganisation of the observations.\n\nThe Minimum Covariance Determinant estimator is a robust,\nhigh-breakdown point (i.e. it can be used to estimate the covariance\nmatrix of highly contaminated datasets, up to\n$\\frac{n_\\text{samples}-n_\\text{features}-1}{2}$ outliers)\nestimator of covariance. The idea is to find\n$\\frac{n_\\text{samples}+n_\\text{features}+1}{2}$\nobservations whose empirical covariance has the smallest determinant,\nyielding a \"pure\" subset of observations from which to compute\nstandards estimates of location and covariance.\n\nThe Minimum Covariance Determinant estimator (MCD) has been introduced\nby P.J.Rousseuw in [1].\n\nThis example illustrates how the Mahalanobis distances are affected by\noutlying data: observations drawn from a contaminating distribution\nare not distinguishable from the observations coming from the real,\nGaussian distribution that one may want to work with. Using MCD-based\nMahalanobis distances, the two populations become\ndistinguishable. Associated applications are outliers detection,\nobservations ranking, clustering, ...\nFor visualization purpose, the cubic root of the Mahalanobis distances\nare represented in the boxplot, as Wilson and Hilferty suggest [2]\n\n[1] P. J. Rousseeuw. Least median of squares regression. J. Am\n    Stat Ass, 79:871, 1984.\n[2] Wilson, E. B., & Hilferty, M. M. (1931). The distribution of chi-square.\n    Proceedings of the National Academy of Sciences of the United States\n    of America, 17, 684-688.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(__doc__)\n\nimport numpy as np\nimport matplotlib.pyplot as plt\n\nfrom sklearn.covariance import EmpiricalCovariance, MinCovDet\n\nn_samples = 125\nn_outliers = 25\nn_features = 2\n\n# generate data\ngen_cov = np.eye(n_features)\ngen_cov[0, 0] = 2.\nX = np.dot(np.random.randn(n_samples, n_features), gen_cov)\n# add some outliers\noutliers_cov = np.eye(n_features)\noutliers_cov[np.arange(1, n_features), np.arange(1, n_features)] = 7.\nX[-n_outliers:] = np.dot(np.random.randn(n_outliers, n_features), outliers_cov)\n\n# fit a Minimum Covariance Determinant (MCD) robust estimator to data\nrobust_cov = MinCovDet().fit(X)\n\n# compare estimators learnt from the full data set with true parameters\nemp_cov = EmpiricalCovariance().fit(X)\n\n# #############################################################################\n# Display results\nfig = plt.figure()\nplt.subplots_adjust(hspace=-.1, wspace=.4, top=.95, bottom=.05)\n\n# Show data set\nsubfig1 = plt.subplot(3, 1, 1)\ninlier_plot = subfig1.scatter(X[:, 0], X[:, 1],\n                              color='black', label='inliers')\noutlier_plot = subfig1.scatter(X[:, 0][-n_outliers:], X[:, 1][-n_outliers:],\n                               color='red', label='outliers')\nsubfig1.set_xlim(subfig1.get_xlim()[0], 11.)\nsubfig1.set_title(\"Mahalanobis distances of a contaminated data set:\")\n\n# Show contours of the distance functions\nxx, yy = np.meshgrid(np.linspace(plt.xlim()[0], plt.xlim()[1], 100),\n                     np.linspace(plt.ylim()[0], plt.ylim()[1], 100))\nzz = np.c_[xx.ravel(), yy.ravel()]\n\nmahal_emp_cov = emp_cov.mahalanobis(zz)\nmahal_emp_cov = mahal_emp_cov.reshape(xx.shape)\nemp_cov_contour = subfig1.contour(xx, yy, np.sqrt(mahal_emp_cov),\n                                  cmap=plt.cm.PuBu_r,\n                                  linestyles='dashed')\n\nmahal_robust_cov = robust_cov.mahalanobis(zz)\nmahal_robust_cov = mahal_robust_cov.reshape(xx.shape)\nrobust_contour = subfig1.contour(xx, yy, np.sqrt(mahal_robust_cov),\n                                 cmap=plt.cm.YlOrBr_r, linestyles='dotted')\n\nsubfig1.legend([emp_cov_contour.collections[1], robust_contour.collections[1],\n                inlier_plot, outlier_plot],\n               ['MLE dist', 'robust dist', 'inliers', 'outliers'],\n               loc=\"upper right\", borderaxespad=0)\nplt.xticks(())\nplt.yticks(())\n\n# Plot the scores for each point\nemp_mahal = emp_cov.mahalanobis(X - np.mean(X, 0)) ** (0.33)\nsubfig2 = plt.subplot(2, 2, 3)\nsubfig2.boxplot([emp_mahal[:-n_outliers], emp_mahal[-n_outliers:]], widths=.25)\nsubfig2.plot(np.full(n_samples - n_outliers, 1.26),\n             emp_mahal[:-n_outliers], '+k', markeredgewidth=1)\nsubfig2.plot(np.full(n_outliers, 2.26),\n             emp_mahal[-n_outliers:], '+k', markeredgewidth=1)\nsubfig2.axes.set_xticklabels(('inliers', 'outliers'), size=15)\nsubfig2.set_ylabel(r\"$\\sqrt[3]{\\rm{(Mahal. dist.)}}$\", size=16)\nsubfig2.set_title(\"1. from non-robust estimates\\n(Maximum Likelihood)\")\nplt.yticks(())\n\nrobust_mahal = robust_cov.mahalanobis(X - robust_cov.location_) ** (0.33)\nsubfig3 = plt.subplot(2, 2, 4)\nsubfig3.boxplot([robust_mahal[:-n_outliers], robust_mahal[-n_outliers:]],\n                widths=.25)\nsubfig3.plot(np.full(n_samples - n_outliers, 1.26),\n             robust_mahal[:-n_outliers], '+k', markeredgewidth=1)\nsubfig3.plot(np.full(n_outliers, 2.26),\n             robust_mahal[-n_outliers:], '+k', markeredgewidth=1)\nsubfig3.axes.set_xticklabels(('inliers', 'outliers'), size=15)\nsubfig3.set_ylabel(r\"$\\sqrt[3]{\\rm{(Mahal. dist.)}}$\", size=16)\nsubfig3.set_title(\"2. from robust estimates\\n(Minimum Covariance Determinant)\")\nplt.yticks(())\n\nplt.show()"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}